Translations of the Callhome Egyptian Arabic Corpus for Conversational Speech Translation

نویسندگان

  • Gaurav Kumar
  • Yuan Cao
  • Ryan Cotterell
  • Chris Callison-Burch
  • Daniel Povey
  • Sanjeev Khudanpur
چکیده

Translation of the output of automatic speech recognition (ASR) systems, also known as speech translation, has received a lot of research interest recently. This is especially true for programs such as DARPA BOLT which focus on improving spontaneous human-human conversation across languages. However, this research is hindered by the dearth of datasets developed for this explicit purpose. For Egyptian Arabic-English, in particular, no parallel speechtranscription-translation dataset exists in the same domain. In order to support research in speech translation, we introduce the Callhome Egyptian Arabic-English Speech Translation Corpus. This supplements the existing LDC corpus with four reference translations for each utterance in the transcripts. The result is a three-way parallel dataset of Egyptian Arabic Speech, transcriptions and English translations.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Improved Speech-to-Text Translation with the Fisher and Callhome Spanish–English Speech Translation Corpus

Research into the translation of the output of automatic speech recognition (ASR) systems is hindered by the dearth of datasets developed for that explicit purpose. For SpanishEnglish translation, in particular, most parallel data available exists only in vastly different domains and registers. In order to support research on cross-lingual speech applications, we introduce the Fisher and Callho...

متن کامل

Microsoft Speech Language Translation (MSLT) Corpus: The IWSLT 2016 release for English, French and German

We describe the Microsoft Speech Language Translation (MSLT) corpus, which was created in order to evaluate endto-end conversational speech translation quality. The corpus was created from actual conversations over Skype, and we provide details on the recording setup and the different layers of associated text data. The corpus release includes Test and Dev sets with reference transcripts for sp...

متن کامل

Multilingual speech recognition: the 1996 byblos callhome system

This paper describes the 1996 Byblos Callhome speech recognition system for Spanish and Egyptian Colloquial Arabic. The system uses a combination of Phoneticly Tied-Mixture Gaussian HMMs and State-Clustered Tied-Mixture Gaussian HMMs in a multiple pass decoder. We focus here on the aspects of the system which are language specific and demonstrate the adaptability of the Byblos English system to...

متن کامل

The BBN Byblos 1997 large vocabulary conversational speech recognition system

This paper presents the 1997 BBN Byblos Large Vocabulary Speech Recognition (LVCSR) system. We give an outline of the algorithms and procedures used to train the system, describe the recognizer configuration and present the major technological innovations that lead to performance improvements. The major testbed we present our results for is the Switchboard Corpus, where current word error rates...

متن کامل

2000 Nist Evaluation of Conversational Speech Recognition over the Telephone: English and Mandarin Performance Results

This paper documents the use of conversational telephone speech test materials in the NIST coordinated evaluation conducted early in 2000. The primary evaluation was of General American English speech, but a subsidiary evaluation of Mandarin speech was also offered. The primary test data consisted of twenty conversations collected for the original Switchboard Corpus but not released with the pu...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014